Prefill batching logic to handle chunked prefill/prefix caching for HPU #753

hlin99 · 2025-12-23T05:46:16Z

Logic to handle chunked prefill/prefix caching for HPU
Due to HPU padding constraints, batching requests with existing
history (ctx != 0) causes excessive memory usage, as the entire
batch must be padded to the longest context, leading to OOM.

This patch enforces a batch size of 1 for prefill operations when
ctx != 0. Although this sacrifices some throughput in corenr cases,
it effectively eliminates the OOM risk.

Due to HPU padding constraints, batching requests with existing history (ctx != 0) causes excessive memory usage, as the entire batch must be padded to the longest context, leading to OOM. This patch enforces a batch size of 1 for prefill operations when ctx != 0. Although this sacrifices some throughput in corenr cases, it effectively eliminates the OOM risk. Signed-off-by: Tony Lin <tony.lin@intel.com>

github-actions · 2025-12-23T09:18:42Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
326e7c31055812277957e3e2b43715b4f366facb

xuechendi

LGTM, @adobrzyn , may you take a second review

xuechendi · 2026-01-06T16:55:30Z

@hlin99 , hmm, after second thought, I am a little unsure how the changes impact unified attention. Will need @kzawora-intel to check

hlin99 · 2026-01-07T01:54:09Z

@hlin99 , hmm, after second thought, I am a little unsure how the changes impact unified attention. Will need @kzawora-intel to check

sure. i wan't aware ua goes to the same code path. if ua can overcome HPU padding nature, we can definitely split the code into ua & non-ua paths. @kzawora-intel please advise. thx

…PU (vllm-project#753) Logic to handle chunked prefill/prefix caching for HPU Due to HPU padding constraints, batching requests with existing history (ctx != 0) causes excessive memory usage, as the entire batch must be padded to the longest context, leading to OOM. This patch enforces a batch size of 1 for prefill operations when ctx != 0. Although this sacrifices some throughput in corenr cases, it effectively eliminates the OOM risk. Signed-off-by: Tony Lin <tony.lin@intel.com> Signed-off-by: Jin, Youzhi <youzhi.jin@intel.com>

…PU (vllm-project#753) Logic to handle chunked prefill/prefix caching for HPU Due to HPU padding constraints, batching requests with existing history (ctx != 0) causes excessive memory usage, as the entire batch must be padded to the longest context, leading to OOM. This patch enforces a batch size of 1 for prefill operations when ctx != 0. Although this sacrifices some throughput in corenr cases, it effectively eliminates the OOM risk. Signed-off-by: Tony Lin <tony.lin@intel.com>

hlin99 requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, kzawora-intel, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners December 23, 2025 05:46

github-actions bot mentioned this pull request Dec 23, 2025

🚦 Team Review Dashboard #701

Open

hlin99 force-pushed the main branch from 4e20bee to 6a5343f Compare December 23, 2025 07:24

hlin99 changed the title ~~Logic to handle chunked prefill/prefix caching for HPU~~ Prefill batching logic to handle chunked prefill/prefix caching for HPU Dec 23, 2025

xuechendi approved these changes Jan 5, 2026

View reviewed changes

xuechendi self-assigned this Jan 6, 2026

xuechendi assigned kzawora-intel and unassigned xuechendi Jan 6, 2026

hlin99 mentioned this pull request Jan 10, 2026

Implement profile_run method in HPU model runner #775

Merged

afierka-intel approved these changes Jan 13, 2026

View reviewed changes

afierka-intel merged commit 872795d into vllm-project:main Jan 13, 2026
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefill batching logic to handle chunked prefill/prefix caching for HPU #753

Prefill batching logic to handle chunked prefill/prefix caching for HPU #753

Uh oh!

hlin99 commented Dec 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 23, 2025

Uh oh!

xuechendi left a comment

Uh oh!

xuechendi commented Jan 6, 2026

Uh oh!

hlin99 commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Prefill batching logic to handle chunked prefill/prefix caching for HPU #753

Prefill batching logic to handle chunked prefill/prefix caching for HPU #753

Uh oh!

Conversation

hlin99 commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 23, 2025

✅ CI Passed

Uh oh!

xuechendi left a comment

Choose a reason for hiding this comment

Uh oh!

xuechendi commented Jan 6, 2026

Uh oh!

hlin99 commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hlin99 commented Dec 23, 2025 •

edited

Loading